Bellabeat a high-tech company that manufactures health-focused smart products for women.

1 Ask

1.1 TASKS

In This Phase I Want To :

  • discover of smart device data to gain insight of what consumers prefer.

  • Improve marketing strategy for the bellabeat company.


2 Prepare

  • In this phase i want to try download the data to start my cleaning .
  • I will use R language due to their capability to cleaning and analysis and visualization and documentation in one program “RStudio

2.1 Properties of the data :

  • This data is public open soruce which can any one download it from kaggle to download or discover it click here
  • This is related of smart device data with 30 user share their information for analysis
  • This data contains 18 file with csv format with related to our process.

The available information is related with 3 time variables,which is day, the minute, and the second To simplify the analysis, I will choose the files related to the day because they store information for a longer time and therefore more accuracy

2.2 First Look Of The Datasets

2.2.1 Daily activity

Table 1: Daily activity
Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance LightActiveDistance SedentaryActiveDistance VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
1503960366 4/12/2016 13162 8.50 8.50 0 1.88 0.55 6.06 0 25 13 328 728 1985
1503960366 4/13/2016 10735 6.97 6.97 0 1.57 0.69 4.71 0 21 19 217 776 1797
1503960366 4/14/2016 10460 6.74 6.74 0 2.44 0.40 3.91 0 30 11 181 1218 1776
1503960366 4/15/2016 9762 6.28 6.28 0 2.14 1.26 2.83 0 29 34 209 726 1745
1503960366 4/16/2016 12669 8.16 8.16 0 2.71 0.41 5.04 0 36 10 221 773 1863

2.2.2 Daily calories

Table 1: Daily calories
Id ActivityDay Calories
1503960366 4/12/2016 1985
1503960366 4/13/2016 1797
1503960366 4/14/2016 1776
1503960366 4/15/2016 1745
1503960366 4/16/2016 1863

2.2.3 Daily intensities

Table 1: daily_calories
Id ActivityDay SedentaryMinutes LightlyActiveMinutes FairlyActiveMinutes VeryActiveMinutes SedentaryActiveDistance LightActiveDistance ModeratelyActiveDistance VeryActiveDistance
1503960366 4/12/2016 728 328 13 25 0 6.06 0.55 1.88
1503960366 4/13/2016 776 217 19 21 0 4.71 0.69 1.57
1503960366 4/14/2016 1218 181 11 30 0 3.91 0.40 2.44
1503960366 4/15/2016 726 209 34 29 0 2.83 1.26 2.14
1503960366 4/16/2016 773 221 10 36 0 5.04 0.41 2.71

2.2.4 Daily steps

Table 1: Daily Daily steps
Id ActivityDay StepTotal
1503960366 4/12/2016 13162
1503960366 4/13/2016 10735
1503960366 4/14/2016 10460
1503960366 4/15/2016 9762
1503960366 4/16/2016 12669

2.2.5 Daily sleep

Table 1: Daily sleep
Id SleepDay TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
1503960366 4/12/2016 12:00:00 AM 1 327 346
1503960366 4/13/2016 12:00:00 AM 2 384 407
1503960366 4/15/2016 12:00:00 AM 1 412 442
1503960366 4/16/2016 12:00:00 AM 2 340 367
1503960366 4/17/2016 12:00:00 AM 1 700 712

2.2.6 Weight

Table 1: Daily Weight
Id Date WeightKg WeightPounds Fat BMI IsManualReport LogId
1503960366 5/2/2016 11:59:59 PM 52.6 115.9631 22 22.65 True 1.462234e+12
1503960366 5/3/2016 11:59:59 PM 52.6 115.9631 NA 22.65 True 1.462320e+12
1927972279 4/13/2016 1:08:52 AM 133.5 294.3171 NA 47.54 False 1.460510e+12
2873212765 4/21/2016 11:59:59 PM 56.7 125.0021 NA 21.45 True 1.461283e+12
2873212765 5/12/2016 11:59:59 PM 57.3 126.3249 NA 21.69 True 1.463098e+12

2.3 Existing problems

We can see there some problems with this data needed to be fixed which are :

  • The format date is inconsistent
  • The header of the column need to be cleaning
  • Some of tables column have the same name which need to be merged
  • There are some null values which need to be deleted
  • Some of dates contains of both date and time, which need to be fixed

3 Process

In this phase i will cleaning and processing my data to make sure is ready to analysis and make decision after that.

3.1 Set up the libraries

library(tidyverse)
library(kableExtra)   
library(scales)   
library(highcharter) 
library(RColorBrewer) 

3.2 Overview

First let’s take a look of data attributes

str(daily_activity)
## 'data.frame':    940 obs. of  15 variables:
##  $ Id                      : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDate            : chr  "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ TotalSteps              : int  13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...
##  $ TotalDistance           : num  8.5 6.97 6.74 6.28 8.16 ...
##  $ TrackerDistance         : num  8.5 6.97 6.74 6.28 8.16 ...
##  $ LoggedActivitiesDistance: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveDistance      : num  1.88 1.57 2.44 2.14 2.71 ...
##  $ ModeratelyActiveDistance: num  0.55 0.69 0.4 1.26 0.41 ...
##  $ LightActiveDistance     : num  6.06 4.71 3.91 2.83 5.04 ...
##  $ SedentaryActiveDistance : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveMinutes       : int  25 21 30 29 36 38 42 50 28 19 ...
##  $ FairlyActiveMinutes     : int  13 19 11 34 10 20 16 31 12 8 ...
##  $ LightlyActiveMinutes    : int  328 217 181 209 221 164 233 264 205 211 ...
##  $ SedentaryMinutes        : int  728 776 1218 726 773 539 1149 775 818 838 ...
##  $ Calories                : int  1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...
str(daily_calories)
## 'data.frame':    940 obs. of  3 variables:
##  $ Id         : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDay: chr  "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ Calories   : int  1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...
str(daily_intensities)
## 'data.frame':    940 obs. of  10 variables:
##  $ Id                      : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDay             : chr  "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ SedentaryMinutes        : int  728 776 1218 726 773 539 1149 775 818 838 ...
##  $ LightlyActiveMinutes    : int  328 217 181 209 221 164 233 264 205 211 ...
##  $ FairlyActiveMinutes     : int  13 19 11 34 10 20 16 31 12 8 ...
##  $ VeryActiveMinutes       : int  25 21 30 29 36 38 42 50 28 19 ...
##  $ SedentaryActiveDistance : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LightActiveDistance     : num  6.06 4.71 3.91 2.83 5.04 ...
##  $ ModeratelyActiveDistance: num  0.55 0.69 0.4 1.26 0.41 ...
##  $ VeryActiveDistance      : num  1.88 1.57 2.44 2.14 2.71 ...
str(daily_steps)
## 'data.frame':    940 obs. of  3 variables:
##  $ Id         : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDay: chr  "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ StepTotal  : int  13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...
str(daily_sleep)
## 'data.frame':    413 obs. of  5 variables:
##  $ Id                : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ SleepDay          : chr  "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
##  $ TotalSleepRecords : int  1 2 1 2 1 1 1 1 1 1 ...
##  $ TotalMinutesAsleep: int  327 384 412 340 700 304 360 325 361 430 ...
##  $ TotalTimeInBed    : int  346 407 442 367 712 320 377 364 384 449 ...
str(weight_info)
## 'data.frame':    67 obs. of  8 variables:
##  $ Id            : num  1.50e+09 1.50e+09 1.93e+09 2.87e+09 2.87e+09 ...
##  $ Date          : chr  "5/2/2016 11:59:59 PM" "5/3/2016 11:59:59 PM" "4/13/2016 1:08:52 AM" "4/21/2016 11:59:59 PM" ...
##  $ WeightKg      : num  52.6 52.6 133.5 56.7 57.3 ...
##  $ WeightPounds  : num  116 116 294 125 126 ...
##  $ Fat           : int  22 NA NA NA NA 25 NA NA NA NA ...
##  $ BMI           : num  22.6 22.6 47.5 21.5 21.7 ...
##  $ IsManualReport: chr  "True" "True" "False" "True" ...
##  $ LogId         : num  1.46e+12 1.46e+12 1.46e+12 1.46e+12 1.46e+12 ...

As we can see there are Incorrect types of variables

  • the dates in all tables are (chr) which needed to be converted into (date) .
  • In weight_info table need to remove e scientific notation from Logid column

3.3 Cleaning

3.3.1 Fix the type of the Data

daily_activity<-daily_activity%>%mutate(ActivityDate = as.Date(ActivityDate,format = "%m/%d/%y")) 

daily_calories<-daily_calories%>%mutate(ActivityDay = as.Date(ActivityDay,format = "%m/%d/%y"))

daily_intensities<-daily_intensities%>%mutate(ActivityDay = as.Date(ActivityDay, format = "%m/%d/%y"))

daily_steps<-daily_steps%>% mutate(ActivityDay = as.Date(ActivityDay,format = "%m/%d/%y"))

daily_sleep<-daily_sleep%>%mutate(SleepDay = as.Date(SleepDay,format="%m/%d/%y"))

weight_info<-weight_info%>%mutate(Date = as.Date(Date,format="%m/%d/%y"))


options(scipen = 999)

3.3.2 cleaning columns and rename it

daily_activity<-daily_activity %>% rename(date = ActivityDate)%>% clean_names()

daily_calories<-daily_calories %>% rename(date = ActivityDay) %>% clean_names()

daily_intensities<-daily_intensities %>% rename(date = ActivityDay)%>% clean_names()

weight_info<-weight_info %>% rename(date = Date)%>% clean_names()

daily_steps<-daily_steps %>% rename(date = ActivityDay,TotalSteps=StepTotal)%>% clean_names()

daily_sleep<-daily_sleep %>% rename(date = SleepDay)%>% clean_names()


Let see our data how’s look


3.3.2.1 Daily activity

Table 1: Daily activity
id date total_steps total_distance tracker_distance logged_activities_distance very_active_distance moderately_active_distance light_active_distance sedentary_active_distance very_active_minutes fairly_active_minutes lightly_active_minutes sedentary_minutes calories
1503960366 2020-04-12 13162 8.50 8.50 0 1.88 0.55 6.06 0 25 13 328 728 1985
1503960366 2020-04-13 10735 6.97 6.97 0 1.57 0.69 4.71 0 21 19 217 776 1797
1503960366 2020-04-14 10460 6.74 6.74 0 2.44 0.40 3.91 0 30 11 181 1218 1776
1503960366 2020-04-15 9762 6.28 6.28 0 2.14 1.26 2.83 0 29 34 209 726 1745

3.3.2.2 Daily calories

Table 1: Daily calories
id date calories
1503960366 2020-04-12 1985
1503960366 2020-04-13 1797
1503960366 2020-04-14 1776
1503960366 2020-04-15 1745


3.3.2.3 Daily intensities

Table 1: daily_calories
id date sedentary_minutes lightly_active_minutes fairly_active_minutes very_active_minutes sedentary_active_distance light_active_distance moderately_active_distance very_active_distance
1503960366 2020-04-12 728 328 13 25 0 6.06 0.55 1.88
1503960366 2020-04-13 776 217 19 21 0 4.71 0.69 1.57
1503960366 2020-04-14 1218 181 11 30 0 3.91 0.40 2.44
1503960366 2020-04-15 726 209 34 29 0 2.83 1.26 2.14

3.3.2.4 Daily steps

Table 1: Daily Daily steps
id date total_steps
1503960366 2020-04-12 13162
1503960366 2020-04-13 10735
1503960366 2020-04-14 10460
1503960366 2020-04-15 9762

3.3.2.5 Daily sleep

Table 1: Daily sleep
id date total_sleep_records total_minutes_asleep total_time_in_bed
1503960366 2020-04-12 1 327 346
1503960366 2020-04-13 2 384 407
1503960366 2020-04-15 1 412 442
1503960366 2020-04-16 2 340 367

3.3.2.6 Weight

Table 1: Daily Weight
id date weight_kg weight_pounds fat bmi is_manual_report log_id
1503960366 2020-05-02 52.6 115.9631 22 22.65 True 1462233599000
1503960366 2020-05-03 52.6 115.9631 NA 22.65 True 1462319999000
1927972279 2020-04-13 133.5 294.3171 NA 47.54 False 1460509732000
2873212765 2020-04-21 56.7 125.0021 NA 21.45 True 1461283199000


3.3.3 merge the data


as we can see the data almost cleaning but now we want to merge the data to complete my cleaning and analysis it later

I will going to merge five Dataframes
1-daily_activity
2-daily_calories
3-daily_intensities
4-daily_steps
5-daily_sleep

because :

  • they have common column names
  • they have sufficient of users Id’s for being analysis
fisrt_list <- list(daily_activity, daily_calories)
fisrt_list<-fisrt_list %>% reduce(full_join, by=c("id","date","calories"))

second_list <- list(fisrt_list, daily_intensities)
second_list<-second_list %>% reduce(full_join, by=c("id"
,"date","sedentary_minutes","lightly_active_minutes"
,"fairly_active_minutes","very_active_minutes"
,"sedentary_active_distance","light_active_distance"
,"moderately_active_distance","very_active_distance"))

Third_list <- list(second_list , daily_steps)
Third_list<-Third_list %>% reduce(full_join, by=c("id","date","total_steps"))

Fourth_list <- list(Third_list , daily_sleep)
Fourth_list<-Fourth_list %>% reduce(full_join, by=c("id","date"))

weight_info <- list(Fourth_list , weight_info)
weight_info<-weight_info %>% reduce(full_join, by=c("id","date"))

3.3.4 Delete duplicates and null values

First I will see If there any nulls or duplicated values

sum(is.na(Fourth_list)) ## TO SEE HOW NULL VALUSE WE GET
## [1] 1590
sum(duplicated(Fourth_list)) ## TO SEE HOW duplicated ROWS WE GET
## [1] 3
Fourth_list<-Fourth_list[!duplicated(Fourth_list), ] ##Delete duplicate Rows
weight_info<-weight_info[!duplicated(Fourth_list), ] ##Delete duplicate Rows


Fourth_list<-drop_na(Fourth_list) ##Delete Nulls Value

weight_info<- subset( weight_info, select = -fat ) ##For delete fat
weight_info<-drop_na(weight_info) ##Delete Nulls Value

weight_info$weekday<- weekdays(weight_info$date) ## add column
Fourth_list$weekday<- weekdays(Fourth_list$date)  ## add column

3.3.4.1 Final List Of The Merged DataFrame

3.3.4.2 Weight Info DataFrame


4 Analyze


Figure : Step Count & Mortality Risk Data. The benefits of walking and running are the greatest between 2,700 and 4,400 steps. Women who got between 4,400 and 5,900 steps per day had a 46% reduction in their risk of death. Women who got between 5,900 and 8,400 steps per day had a 53% reduction in their risk of death. Women who got more than 8,400 steps per day had a 66% reduction in their risk of death. Bars represent overall mortality. All values are compared to women who got fewer than 2,700 steps per day..

reference click


so i will going to determine the total of the steps of each person in this data based in the figure above

4.1 Total steps per day !

  • i will combine the values in Total Steps based on conditions winch are :
    total step < 2700 is baseline
    TotalSteps >= 2700 and < 5900 is -46% Mortailty Rate <
    TotalSteps >= 5900 and < 8500 is -53% Mortailty Rate
    TotalSteps >= 8500 is -66% Mortailty Rate
count_steps<-Fourth_list %>% 
  mutate(total_steps = 
           case_when(
             total_steps < 2700 ~ "Baseline",
             total_steps >= 2700 & total_steps < 5900 ~ " -46% Mortailty Rate ",
             total_steps >= 5900 & total_steps < 8500 ~ "-53% Mortailty Rate",
             total_steps >= 8500 ~ "-66% Mortailty Rate"
           )
  )




count_steps <- count_steps  %>% 
  count(total_steps) %>% 
  arrange(n)

 count_steps %>%
  e_charts(x = total_steps) %>%
  e_bar(n, name = " Total of Mortailty Rate") %>%
    e_toolbox_feature(
  feature = "magicType",
  type = list("line", "bar")
  )  %>%
  e_theme("chalk") %>%
  e_theme_custom('{"color":["#ff715e","#ffaf51"]}') %>%
     e_tooltip(trigger = "axis")


as you can see above
much of users are active and has high steps rate more 8500 steps per day !


4.2 Is days of the week effects on steps rate ?

Fourth_list$weekday<- weekdays(Fourth_list$date) 

fig <- Fourth_list %>%
  plot_ly(
    x = ~total_steps, 
    y = ~calories,
    color = ~calories ,
    size = ~ calories,
    frame = ~weekday, 
    hoverinfo = "text",
    type = 'scatter',
    mode = 'markers'
  )
fig <- fig %>% layout(
  xaxis = list(
    type = "log")) 
fig

as you can see in the above graph there’s two things :

  • the step rate is almost the same except some user at Saturday and Thursday have more activity than the other days

  • the calories is almost the same except some user at Saturday and Thursday and Tuesday have more activity than the other days

  • we can also see many of the user have higher steps rate but low calories !? so in this case i want to discover what are the another factors that affect on the calories .


Fourth_list |>
  e_charts() |>
  e_boxplot(very_active_minutes) |>
  e_boxplot(lightly_active_minutes) %>%
  e_boxplot(sedentary_minutes ) %>%
  e_boxplot(fairly_active_minutes) %>%
 e_theme_custom('{"color":["#ff715e","#ffaf51"]}') 


Fourth_list%>%
  e_charts(calories ) %>%
  e_bar(very_active_distance) %>% 
  e_bar(light_active_distance) %>% 
  e_bar(logged_activities_distance) %>% 
  e_bar(moderately_active_distance) %>% 
 
  e_tooltip()%>%
  e_datazoom(start = 50) %>%
  e_toolbox_feature(
    feature = "magicType",
    type = list("line", "bar"))
the highest calories burned in active distance and lowest calories burned in lightly distance


sleep_per_hours<-Fourth_list %>% 
  mutate(total_minutes_asleep = 
           case_when(
             total_minutes_asleep < 420  ~ "Less than 7h",
             total_minutes_asleep <= 540 ~ "7h to 9h",
             total_minutes_asleep > 540  ~ "More than 9h"
           )
  )


sleep_per_hours <- sleep_per_hours  %>% 
  count(total_minutes_asleep) %>% 
  arrange(n)

sleep_per_hours%>% 
  e_charts(x = total_minutes_asleep) %>% 
  e_pie( n , radius = c("50%", "70%"))  %>% 
  e_title("Total  time of sleep ",
          textAlign  = "center",
          left ="50%") %>%
  e_labels(show = TRUE,
           formatter = "{c} \n {d}%",
           position = "inside") %>%
  e_legend(right = 0, 
           orient = "vertical")

4.3 Sleep

Fourth_list %>% 
  select(total_minutes_asleep, total_steps) %>% 
  mutate(sleep_level = case_when(
    total_minutes_asleep < 420  ~ "Less than 7h",
    total_minutes_asleep <= 540 ~ "7h to 9h",
    total_minutes_asleep > 540  ~ "More than 9h")) %>% 
  
  mutate(steps_level = case_when(
    total_steps < 2700 ~ "less than 2700 steps ",
    total_steps >= 2700 & total_steps < 5900 ~ " between 2700 and 5900 steps ",
    total_steps >= 5900 & total_steps < 8500 ~ "between 5900 and 8500 steps",
    total_steps >= 8500 ~ "more than 8500 steps")) %>%
  
  select(-c(total_minutes_asleep, total_steps)) %>% 
  group_by(sleep_level, steps_level) %>% 
  summarise(counts = n()) %>% 
  ggplot(aes(x = sleep_level, 
             y = counts, 
             fill = sleep_level)) +
  geom_bar(stat = "identity") +
  theme_classic() +
  scale_fill_manual(values=c("#4285F4", "#DB4437", "#F4B400")) +
  facet_wrap(~steps_level, nrow = 1) +
  theme(legend.position = "none") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme(strip.text = element_text(colour = 'black', size = 8)) +
  theme(strip.background = element_rect(fill = "beige", color = 'black'))+
  labs(
    title = "Sleep  vs. steps",
    x = "Sleep level",
    y = "Count"
  )


here we can see in this graph the people who have more than 8500 steps rate has sleep level with less than 7 hours


4.4 Weight

<

weight_info %>% 
  select(weight_kg, total_steps) %>% 
  mutate(steps_level = case_when(
    total_steps < 2700 ~ "less than 2700 steps ",
    total_steps >= 2700 & total_steps < 5900 ~ " between 2700 and 5900 steps ",
    total_steps >= 5900 & total_steps < 8500 ~ "between 5900 and 8500 steps",
    total_steps >= 8500 ~ "more than 8500 steps")) %>%
  mutate(weight_level = case_when(
    weight_kg <60 ~ "Less 60 kg ",
    weight_kg > 60 ~ "more than 60 kg")) %>%
  
 select(-c(total_steps, weight_kg)) %>% 
  group_by(steps_level,weight_level) %>% 
  summarise(counts = n()) %>% 
  ggplot(aes(x = weight_level, y =  counts, fill = weight_level)) +
  geom_bar(stat = "identity") +
  theme_classic() +
  scale_fill_manual(values=c("#4285F4","#0F9D58")) +
  facet_wrap(~steps_level, nrow = 1) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
   theme(legend.position = "none") +
  theme(strip.text = element_text(colour = 'black', size = 8)) +
  theme(strip.background = element_rect(fill = "beige", color = 'black'))+
  labs(title = "User weight vs. steps rate",
    x = " weight ",
    y = "Count")

we can see two things

  • there more activity in users with more 60 Kg a then of users less of 60 kg

  • the user in more 60 kg has much steps rate more than 8500 steps



5 My conclusion

my final conclusion based on my analysis is :

  • more of users that share their information with more than 60 kg has much steps rate than people with less than 60 kg .
    my recommendation is : targeting young people who have less than 60 kg to motivate them to participate with walking

  • people with less than 7h of sleep has much steps of rate .
    my recommendation is : we can invite them to take a rest and get better sleep rate

  • people who have lowest calories burned have lightly distance.
    my recommendation is : we can make people with “lightly distance” to decrease their “lightly distance” by put limited time beween the take rest and the walk .

  • some of people has higher steps rate but low calories
    my recommendation is : due to the lightly active minutes so we need to put notification in our smart watch to notification users about “lightly time minutes” when it repeat many times . .